随着人们对精神危机及其社会影响的认识,在许多国家,提供紧急支持的在线服务变得司空见惯。接受寻求帮助者和提供者之间讨论的培训的计算模型可以通过识别高危个人来支持预防自杀。但是,缺乏特定领域的模型,尤其是在低资源语言中,对自动检测自杀风险构成了重大挑战。我们提出了一个模型,该模型将预训练的语言模型(PLM)与固定的一组手动制作(并经过临床批准)的自杀提示相结合,然后进行了两阶段的微调过程。我们的模型达到了0.91 ROC-AUC和0.55的F2分数,甚至在对话的早期就表现出了一系列强大的基线,这对于该领域的实时检测至关重要。此外,该模型在性别和年龄段之间表现良好。
translated by 谷歌翻译
This work profoundly analyzes discrete self-supervised speech representations through the eyes of Generative Spoken Language Modeling (GSLM). Following the findings of such an analysis, we propose practical improvements to the discrete unit for the GSLM. First, we start comprehending these units by analyzing them in three axes: interpretation, visualization, and resynthesis. Our analysis finds a high correlation between the speech units to phonemes and phoneme families, while their correlation with speaker or gender is weaker. Additionally, we found redundancies in the extracted units and claim that one reason may be the units' context. Following this analysis, we propose a new, unsupervised metric to measure unit redundancies. Finally, we use this metric to develop new methods that improve the robustness of units clustering and show significant improvement considering zero-resource speech metrics such as ABX. Code and analysis tools are available under the following link.
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
translated by 谷歌翻译
Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound as if uttered by a different speaker, while keeping other aspects like content unchanged. Current VC methods, focus primarily on spectral features like timbre, while ignoring the unique speaking style of people which often impacts prosody. In this study, we introduce a method for converting not only the timbre, but also prosodic information (i.e., rhythm and pitch changes) to those of the target speaker. The proposed approach is based on a pretrained, self-supervised, model for encoding speech to discrete units, which make it simple, effective, and easy to optimise. We consider the many-to-many setting with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate the proposed approach is significantly superior to the evaluated baselines. Code and samples can be found under https://pages.cs.huji.ac.il/adiyoss-lab/dissc/ .
translated by 谷歌翻译
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method supports existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
translated by 谷歌翻译
测试时间培训通过使用自学意义的每个测试输入优化模型,可以随时适应新的测试分布。在本文中,我们将蒙版的自动编码器用于这个单样本学习问题。从经验上讲,我们的简单方法改善了许多视觉基准的概括,以进行分配变化。从理论上讲,我们根据偏见变化权衡取得的改进来表征。
translated by 谷歌翻译
一个人如何在没有特定任务的固定或任何模型修改的情况下将预训练的视觉模型调整为新颖的下游任务?受到NLP提示的启发,本文研究了视觉提示:在测试时间和新输入图像时,给定的输入输出图像示例示例,目标是自动生成输出图像,与给定的示例一致。我们表明,将这个问题作为简单的图像插入,实际上只是填充了串联的视觉提示图像中的一个孔 - 只要已经对正确的数据训练了介入算法,就非常有效。我们在我们策划的新数据集上训练蒙面的自动编码器-88K未标记的数字来自ARXIV上的学术报纸来源。我们将视觉提示应用于这些预处理的模型,并在各种下游图像到图像任务上展示结果,包括前景分割,单个对象检测,着色,边缘检测等。
translated by 谷歌翻译
卷积神经网络包含强大的先验,用于产生自然的图像[1]。这些先验可以以无监督的方式启用图像降解,超级分辨率和灌输。以前尝试在音频中展示类似想法的尝试,即深度音频先验,(i)使用诸如谐波卷积之类的手挑选的体系结构,(ii)仅使用频谱输入工作,并且(iii)主要用于消除高斯噪声[2]。在这项工作中,我们表明,即使在使用原始波形时,现有的音频源分离的SOTA体系结构也包含深度先验。可以通过训练神经网络来发现深度先验,以产生单个损坏的信号,因为将白噪声作为输入。具有相关深度先验的网络可能会在损坏的信号收敛之前生成更清洁的信号版本。我们通过几种损坏证明了这种恢复效果:背景噪声,混响和信号中的差距(音频介绍)。
translated by 谷歌翻译
最近显示外部眼睛照片显示出糖尿病性视网膜疾病和HBA1C升高的迹象。在本文中,我们评估外部眼睛照片是否包含有关其他系统性医疗状况的信息。我们开发了一个深度学习系统(DLS),该系统将外部眼睛的照片作为输入,并预测多个全身参数,例如与肝脏有关的参数(白蛋白,AST);肾脏(EGFR使用无种族的2021 CKD-EPI肌酐方程,尿液ACR);骨与矿物质(钙);甲状腺(TSH);和血数(HGB,WBC,血小板)。开发利用了49,015例糖尿病患者的151,237张图像,在加利福尼亚州洛杉矶县的11个地点接受糖尿病眼镜筛查。评估重点是9个预先指定的全身参数,并利用了3个验证集(a,b,c),涵盖了28,869名患有和没有糖尿病的患者,在加利福尼亚州洛杉矶县和大亚特兰大地区的3个独立地点进行了眼睛筛查。我们将结合了可用临床人口统计学变量的基线模型(例如年龄,性别,种族/种族,糖尿病年)进行了比较。相对于基线,DLS在检测AST> 36,钙<8.6,egfr <60,HGB <11,血小板<150,ACR> = 300和WBC <4时,在检测AST> 36,钙<8.6,Egfr <60,HGB <60,HGB <60,calcium <8.6,Egfr <60,calcium <8.6和wbc <4时,达到了统计学上的显着性能,并且类似于开发集的人口),其中DLS的AUC超过基线的AUC,增长了5.2-19.4%。在验证集B和C方面,与开发集相比,患者人群的差异很大,DLS的表现优于ACR> = 300的基线,而HGB <11升至7.3-13.2%。我们的发现提供了进一步的证据,表明外部眼睛照片包含跨越多器官系统的全身健康生物标志物。需要进一步的工作来研究这些生物标志物是否以及如何转化为临床影响。
translated by 谷歌翻译